Domain Adaptation in Sequence Analysis

نویسندگان

  • Christian Widmer
  • Daniel Huson
  • Gunnar Rätsch
چکیده

The machine learning approach to problems from computational biology is to learn models from labeled training examples. A common problem in supervised learning is the lack of sufficient training data. Often, however, there exists a large body of data on a different, but related problem. In order to boost performance on our problem of interest, we would like to utilize information from this related domain. The field of domain adaptation formalizes this scenario and provides intelligent algorithms to transfer information between domains. The aim of this work is to compare, extend and develop domain adaptation algorithms suited for problems in computational biology. A central part of this work is the evaluation of domain adaptation methods in large scale computational experiments. To our knowledge, this work constitutes the first thorough experimental comparison of available domain adaptation algorithms in a well controlled experimental framework. Furthermore, we are one of the first to apply such methods to problems in computational biology, which appears to provide a wealth of potential applications. Representative for the class of sequence-based classification problems we chose the problem of splice site prediction an important subproblem in computational gene finding. In splice site prediction, challenges arise from an extremely high dimensional feature space, unbalanced data sets and the fact that there is very little labeled data available for newly sequenced organisms. To improve performance on the organism of interest, we turn to other organisms as additional sources of information. Experiments were conducted for a set of target organisms of varying phylogenetic distance to the source organism and for multiple data set sizes. Comparing the methods over a range of conditions allows us to draw conclusions under what circumstances which methods perform well. We observed that the methods that learn solutions on all data sources simultaneously perform best. Furthermore, methods based on simple convex combinations provide the most cost-effective solution to the domain adaptation problem. The largest improvements over baseline methods can be observed if the organisms are only distantly related. Using the example of splice site prediction, we can show that domain adaptation algorithms are well suited to considerably boost predictive performance on sequence based classification problems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sample-oriented Domain Adaptation for Image Classification

Image processing is a method to perform some operations on an image, in order to get an enhanced image or to extract some useful information from it. The conventional image processing algorithms cannot perform well in scenarios where the training images (source domain) that are used to learn the model have a different distribution with test images (target domain). Also, many real world applicat...

متن کامل

Deep Unsupervised Domain Adaptation for Image Classification via Low Rank Representation Learning

Domain adaptation is a powerful technique given a wide amount of labeled data from similar attributes in different domains. In real-world applications, there is a huge number of data but almost more of them are unlabeled. It is effective in image classification where it is expensive and time-consuming to obtain adequate label data. We propose a novel method named DALRRL, which consists of deep ...

متن کامل

Identification of yeast species from uncultivated soils by sequence analysis of the hypervariable D1/D2 domain of LSU–rDNA gene in Kermanshah province, Iran

Yeasts are a polyphyletic group of ascomycete and basidiomycete fungi characterized by having a unicellular growth phase and sexual stages that are not enclosed in fruiting bodies. An attempt was made to identify yeast species in uncultivated soils collected from different areas of Kermanshah province, Iran, by analyzing hypervariable D1/D2 domain of the large subunit (LSU) rDNA gene sequencean...

متن کامل

An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

We study the problem of domain transfer for a supervised classification task in mRNA splicing. We consider a number of recent domain transfer methods from machine learning, including some that are novel, and evaluate them on genomic sequence data from model organisms of varying evolutionary distance. We find that in cases where the organisms are not closely related, the use of domain adaptation...

متن کامل

Designing, Optimization and Construction of Myelin Basic Protein Coding Sequence Binding to the Immunogenic Subunit of Cholera Toxin

Abstract Background and Objectives: Multiple sclerosis (MS) is a chronic inflammatory autoimmune disease. Mucosal feeding of myelin basic protein binding to the cholera toxin B subunit can reduce the intensity of the immune response in MS patients. Expression system, the domain composition of the fusion protein, accessibility of two domains, codon adaptation index (CAI) and GC contents are v...

متن کامل

Molecular analysis of AbOmpA type-1 as immunogenic target for therapeutic interventions against MDR Acinetobacter baumannii infection

Introduction: Acinetobacter baumannii is associated with hospital-acquired infections. Outer membrane protein A of A.baumannii (AbOmpA) is a well-characterized virulence factor which has important roles in pathogenesis of this bacterium. Methods: Based on our PCR-sequencing of ompA gene in the clinical isolates, AbOmpA protein can be categorized into two types, named here type-1 and type-2. We ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008